perf: Optimize concat()/concat_ws() UDFs#20317
Conversation
|
Benchmark results: |
This commit implements three optimizations: * In `StringViewArrayBuilder`, we re-allocated `block` after every call to `append_offset`. It is cheaper to instead clear and re-use `block`. * In `StringViewArrayBuilder::write()`, we re-validated that a string array consists of valid UTF8 characters. This was unnecessary work and can be skipped. * In the concat() UDF implementation, we miscalculated the initial size of the StringViewArrayBuilder buffer. This didn't lead to incorrect behavior but it resulted in unnecessarily needing to reallocate the buffer.
0eac046 to
992b569
Compare
|
@Jefffrey Is this okay to land in |
| let string_array = as_string_view_array(array)?; | ||
|
|
||
| data_size += string_array.len(); | ||
| data_size += string_array.total_buffer_bytes_used(); |
There was a problem hiding this comment.
This does not include the inlined strings (shorter than 12 bytes). An array of only short strings will report 0.
Is this a problem ?!
If YES then let's use something like:
self.views().iter().sum().min(usize::MAX) as usizeThere was a problem hiding this comment.
Fair point.
On your suggested fix: summing the entire 128 bit view seems wrong (we don't want to add the buffer indexes and offsets, etc.). I think it would work to just use the low 32 bits of each view, but I'm leery of depending on an Arrow implementation detail like that.
We could iterate over the strings and sum their length, but that seems like overkill for a buffer sizing hint.
How about:
- Use total_buffer_bytes_used() for now; this is just a capacity hint, so an estimate is okay
- Add a comment for the short-strings case
- Open an issue in Arrow to have them add a helper for this use-case, as it seems reasonably useful in general
There was a problem hiding this comment.
My suggestion is similar to the impl at https://github.com/apache/arrow-rs/blob/2f40f78e4feae3aee261d9608cede9535e1429e0/arrow-array/src/array/byte_view_array.rs#L685. It just does not filter out the small strings and uses u128 for the sum before the final cast to usize.
There was a problem hiding this comment.
The code in Arrow casts *v to u32, which takes the low 32 bits. Summing all 128 bit values and then taking the min of that value and usize::MAX does not seem to do the right thing, unless I'm misunderstanding completely.
In any case, I'd prefer to not depend on Arrow implementation details, if we can avoid it.
|
I think this should be good to merge once the conflict is resolved; it would be nice indeed if we had a unified way of estimating (whether accurate or not) view size, whether upstream or in a common function in DataFusion but that can be a follow up |
|
@Jefffrey Great! I resolved the merge conflict. |
|
Thanks @neilconway, @alamb & @martin-g |
Which issue does this PR close?
Rationale for this change
Faster is better.
What changes are included in this PR?
This commit implements three optimizations:
In
StringViewArrayBuilder, we recreatedblockafter every call toappend_offset. It is cheaper to instead clear and re-useblock.In
StringViewArrayBuilder::write(), we re-validated that a string array consists of valid UTF8 characters. This was unnecessary work and can be skipped.In the concat() UDF implementation, we miscalculated the initial size of the StringViewArrayBuilder buffer. This didn't lead to incorrect behavior but it resulted in unnecessarily needing to reallocate the buffer.
Are these changes tested?
Yes; no additional test cases warranted.
Are there any user-facing changes?
No.